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Abstract 

The fundamental aim of clustering algo- 
rithms is to partition data points. We con- 
sider tasks where the discovered partition is 
allowed to vary with some covariate such as 
space or time. One approach would be to 
use fragmentation-coagulation processes, but 
these, being Markov processes, are restricted 
to linear or tree structured covariate spaces. 
We define a partition-valued process on an 
arbitrary covariate space using Gaussian pro- 
cesses and a novel interpretation of the stick 
breaking construction. By choosing the pa- 
rameters of the stick breaking construction 
the process can be given Chinese restaurant 
process or Pitman Yor process marginals. We 
use the process to construct a multitask clus- 
tering model which partitions datapoints in a 
similar way across multiple data sources, and 
a time series model of network data which al- 
lows cluster assignments to vary over time. 
We use Elliptical Slice Sampling for infer- 
ence and apply our method to defining can- 
cer subtypes based on different types of cel- 
lular characteristics, finding regulatory mod- 
ules from gene expression data from multi- 
ple human populations, and discovering time 
varying community structure in a social net- 
work. 
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1. Introduction 

We are interested in problems where the partitioning 
of data into groups depends on some covariate. As 
a simple example, consider how the partitioning of a 
number of people into friendship groups might evolve 
over time (the covariate in this case). At proximal 
times people will tend to form similar partitions, but 
at more distant times the partitioning might be quite 
different. One of our examples in Section 9 will involve 
modelling such data. We define a nonparamctric pro- 
cess that induces dependency between partitions on a 
covariate space. 

Many nonparametric processes studied in the liter- 
ature, such as the Dirichlet process (DP, Ferguson, 
1973), are distributions over the space of measures. 
The DP is used to construct a distribution over the 
space of partitions known as the Chinese restaurant 
process (CRP, Aldous, 1983). Dependent nonpara- 
metric processes extend distributions over measures 
and partitions to give distributions over collections of 
measures or partitions indexed by locations in some 
covariate space (MacEachern, 1999). Covariate spaces 
include E + (e.g. continuous time), Z (e.g. discrete 
time), or R d (e.g. geographical location). Most of 
the dependent nonparametric processes explored in 
the literature define distributions over collections of 
measures. Dependency among the measures on each 
location may then be induced in various ways. The 
single-p Dependent Dirichlet Process (DDP, MacEach- 
ern, 1999) assumes a shared set of atom weights at each 
covariate index and induces dependency by allowing 
the corresponding atom locations to vary according to 
some stochastic process, e.g. a Gaussian process (GP, 
Rasmussen & Williams, 2006). If the covariate space 
is time, the process defines a nonparametric mixture 
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model at each index and the characteristics of each 
component evolve over time. The multiple-p DDP by 
MacEachern (2000) allows the same atom locations at 
each covariate index but the weights of each atom 
are dependent across the indices. This is achieved 
by replacing the beta-distributed random variables in 
the stick-breaking construction with samples from a 
stochastic process evolving over the covariate space. 

By de Finetti's theorem, any exchangeable sequence is 
equivalent to i.i.d. draws from a conditional distribu- 
tion and can be written as a mixture of such distribu- 
tions. Both the single-p and multiple-p DDP have the 
DP as their de Finetti mixing measure and introduce 
dependency in the mixing distribution either through 
the atom locations or weights. The generalized spa- 
tial Dirichlct process by Duan et al. (2005) induces 
dependency by assuming that the mixing distribution 
is common to all covariate indices, but the conditional 
distributions are correlated. It defines a DP mixture 
model of Gaussian random fields at each covariate in- 
dex. 

Despite this long time of research into dependent 
measure-valued processes, little attention has been 
given to dependent partition-valued processes. A sam- 
ple from such a process is collection of partitions in- 
dexed by the covariate: at any single covariate loca- 
tion, there is a single partition. An exception is Teh 
et al. (2011), where the duality between Kingman's 
coalescent (KC, Kingman, 1982) and the Dirichlet dif- 
fusion tree (DDT, Neal, 2003a) is leveraged to define 
a "Fragmentation-Coagulation" process (FCP) which 
is Markov, stationary, exchangeable and has CRP dis- 
tributed marginals (Bertoin, 2006). The FCP defines 
a distribution over a collection of partitions on a one 
dimensional covariate space. The partitioning at each 
covariate location is a result of fragmentation (accord- 
ing to the DDT) and coagulation (according to the 
KC) events that take place between adjacent covari- 
ate locations. Although mathematically elegant, due 
to its Markov construction it is not clear how to ex- 
tend the FCP to an arbitrary covariate space. In this 
paper, we derive a dependent partition-valued process 
on an arbitrary covariate space which, like the FCP, is 
exchangeable and has CRP distributed marginals. For 
brevity, we refer this process as DPVP for "Dependent 
Partition- Valued Process" . 

The DPVP is closely related to the dependent IBP 
(dIBP, Williamson et al., 2010), which addresses the 
problem of modelling dependence for binary latent fea- 
ture models. Coupling over the covariate, in both pro- 
cesses, is achieved by representing Bernoulli variables 
at each covariate index as transformed Gaussian vari- 



ables and aggregating these into Gaussian processes 
over the covariate. The dIBP generares a set of binary 
feature matrices evolving over the covariate, while the 
DPVP couples a set of CRPs. 

We use the DPVP to construct two distinct models. 
The first is a multitask clustering model (MCM) which 
attempts to find similar partitions of objects across 
distinct data views. Our approach learns the similar- 
ity between the clustering in each data source. MCM 
is closely related to the Multiple Dataset Integration 
model (MDI, Kirk et al., 2012), where the conditional 
probability of allocating a sample to a cluster in one 
data source is influenced by the assignments in other 
data sources. Like MCM, MDI can learn how similar 
the clusterings should be across different data sources 
using positive real-valued parameters for each pair of 
data sources. However, MCM is a valid generative 
model unlike MDI which is defined only in terms of 
these conditional distributions. Both models can be 
considered as using finite approximations to a DP mix- 
ture model at each location (data source): where MDI 
simply uses a large finite Dirichlct distribution, MCM 
uses the stick breaking construction of the DP (Ish- 
waran & James, 2001b). Another distinction between 
the two approaches lies in the way the dependency 
across the partitions is induced. The MDI model 
introduces a positive real-valued parameter for each 
pair of data sources that describes their (dis)similarity, 
whereas MCM uses Gaussian processes to induce de- 
pendency between the partitions across the covariate 
space. 

The outline of this paper is as follows. In Section 2, 
we briefly provide some background on partitions, the 
CRP and the stick breaking construction. In Section 
3, we present the dependent paritition-valued process 
and in Sections 4 and 5 describe how to use this pro- 
cess to build a mutlitask clustering model and a model 
for time evolving community structure respectively. In 
Section 6, we describe how different choices of kernel 
may be used for different applications. Inference in 
our model is performed via a Gibbs sampler, which is 
described in Section 7. In Section 8 and 9 we describe 
case studies for mutlitask clustering and network mod- 
elling. Finally, in Section 10 we conclude our work and 
discuss some future directions. 

2. Background 

A partition of [N] = {1, .., N} is a set of disjoint non- 
empty subsets of [N] such that the union of these sub- 
sets is [N]. In clustering applications we refer to each 
subset as a "cluster". The set of partitions of [N] 
is denoted Hn- The most natural distribution over 
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IIjv is the Chinese restaurant process (CRP). Since 
the CRP is exchangeable (invariant to permutations of 
[N]) and projective (consistent under marginalisation 
of elements), by de Finetti's theorem it can be repre- 
sented using i.i.d. samples from a random measure. 
For the CRP this is the Dirichlet process (DP). The 
weights of the DP can be represented using a "stick 
breaking" construction as follows [Sethuraman (1994); 
Ishwaran & James (2001a)]: 

k-l 

= v k \\{i - v{) Vfc e Z 

1=1 

v k ~ iid Beta(l, a) Vfc G Z (1) 

In the above, the Tr k 's are mixture proportions (stick 
lengths) and a > is known as the concentration pa- 
rameter. If we now consider sampling cluster assign- 
ments c n \n ~ Multinomial(7r) for n £ [N], then the 
corresponding partition, 7 of [N] is marginally CRP 
distributed. The partition is obtained from the cluster 
assignments c by putting n and m in the same subset 

iff Cn C m . 

3. A dependent partition-valued 
process 

Given some covariate space T our aim is to con- 
struct an exchangeable, projective, dependent process 
(7(f), t £ T) such that each j(t) is a random partition 
which is marginally CRP distributed. To achieve this 
we will use the following representation of the stick- 
breaking construction (Equation 1): 

For each object n £ [N] 

1. Set fc := 1 

2. Sample a nk ~ Bernoulli^) 

3. If a nk then assign c„ := fc, else increment fc and 
go to 2. 

It is straightforward to see that the probability of 
choosing cluster fc is given by the stick length ix k since 

P(c n = k\v) = P{a nl = OH . . . 

P(a n ( fe -i) = 0\v)P(a nk = l\v) 

= (1 - Vl) . . . (1 - V k -l)v k = 7Tfe (2) 

Following Williamson et al. (2010), we note that the 
Bernoulli random variable a nk can be represented as 

f nk ~N{Q,a 2 ) 

a„ fe =I[/nfe<</>- 1 K|0 ) (T 2 )] (3) 



where </>(.|/U, c 2 ) is the Gaussian cumulative distribu- 
tion function with mean, \x and variance, a 2 . We now 
extend these random variables to random functions on 
T and introduce dependency by extending the Gaus- 
sian prior on each f nk to a Gaussian process (GP) prior 
on each f nk (t) with t £l~: 

Uk(t) ~ iid GP(0,£(M')) Vn G [N] , fc G Z 

a nk (t) = I[f nk (t) < ^(ufclO, £(i,t))] (4) 

where T,(t,t') is the covariance function which we as- 
sume to be common to all the GPs. As a result of 
the marginalisation properties of GPs, each f nk (t) is 
marginally 7V(0, £(£,£)) distributed, so that a nk (t) is 
marginally Bernoulli (v k ) distributed, and the result- 
ing partition 7(f) is marginally CRP distributed, as 
desired. 

To summarise, the DPVP generative process is 

v k ~iid Beta(l, a) Vfc eZ 

fnk ~iid GP(0, E) Vn G [N] , fc G Z 

c n (t) = min fc (5) 

f nk {t)«f>- 1 {v k \Q,Ut,t)) 

We do not make the v k a function on T but instead 
assume global mixing proportions: relaxing this con- 
straint would be a straightforward extension. 

In practice, we cannot represent a countably infinite 
set of GPs. While we could adaptively extend our rep- 
resentation as required, we instead choose the simpler 
option of truncating the stick breaking construction at 
some level K. If a nk (t) = for all fc < if, then we set 
c n {t) — K. To sample from the if -truncated DPVP 
at T locations {t T £ T\t = 1,..,T} we therefore re- 
quire K-l T- vectors f nk = [f nk (ti) . . . f nk (t T )], for 
each object n, from a Gaussian process with a T x T 
Gram (covariance) matrix, E, where E rT ' = E(£ T ,t T /). 
For notational simplicity, we concatenate the / vectors 
into a T X N (K — 1) matrix, F. By drawing from a 
GP as in 3, we introduce dependency among the parti- 
tions in different covariate locations. The dependence 
is defined by the covariance function of the GP, E, and 
introduces similarity between the partitions at proxi- 
mal covariate locations. 

We denote a sample from the DPVP as 
DPVP(a, t, E), where a is the concentration 
parameter of the underlying DP, t is the vector of 
covariate locations and E is the Gram matrix. We 
now use this process to construct two models: a 
multitask clustering model and a network model that 
allows the discovered community structure to vary 
through time. The graphical model for both is shown 
in Figure 1. 
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Figure 1. Graphical model for both the Multitask Clus- 
tering Model (MCM) and Evolving Community Structure 
(ECS) model, c are a deterministic function of / and v 
but we show both for clarity. The thick line linking the f's 
denotes dependence through the Gaussian process. 

4. Multitask clustering 

We are interested in the situation where we have a col- 
lection of N objects, for each of which we have mea- 
surements from T different data sources. We associate 
each data task r with a covariate location t T g T, al- 
though this might be arbitrary. Our model will assume 
that for each data source the objects are grouped into 
clusters forming a partition. We allow the clustering 
to be different for each data source, but model depen- 
dency between these partitions using the DPVP. De- 
note the data for object n in data source r as y^ £ y T 7 
where we allow the observed space y T to be different 
for each data source. The Multitask Clustering Model 
(MCM) is then 

cl ~ DPVP(a,t,E) 
6 T k ~ G T 

ylKJl-F^Olr) (6) 

where Q T k are cluster parameters, G T are priors on the 
cluster parameters and F T are data likelihoods. In the 
following we assume all the data sources are continu- 
ous, i.e. y T = MP , and can therefore be represented 
as a N x D T matrix Y T e R NxDT . We allow each 
data source to have a different observed dimensional 
D T . We use a diagonal (independent per dimension) 
Gaussian likelihood and its conjugate normal-gamma 
prior on the cluster parameters: 

(mL: x ld) ~ Nipld^o, —^r)G a {Kd\ a o, -jr) 

K o* kd Po 

yldK^l X ~N((llr d ,l/\l-r d ) (7) 



where we set k = 0.1, /io = 0, /3q = 0.1, a — 0.1. 
By choosing the conjugate prior we are able to in- 
tegrate out the cluster parameters during inference. 
The choice of a diagonal covariance allows scaling to 
larger datasets. The generalisation to other data types 
would be straightforward using the appropriate conju- 
gate prior. 

5. A model for time evolving 
community structure 

Most models of network data assume a static network, 
whereas real world network typically evolve over time. 
We use the DPVP to build a model which discovers 
clusters in network data, but allows cluster assign- 
ments to change over time. We refer to this model 
as ECS for "Evolving Community Structure" . In this 
case our data Y T at each location r is a binary N x N 
matrix representing the presence or absence of links 
between objects. The assignment of the objects to 
groups at each covariate index t determines the prob- 
ability of links in an analogous fashion to the Infinite 
Relationship Model (IRM, Kemp et al., 2006). 

c;~DPFP(a,t,E) (8) 
0£ fe ,~Beta(/3,/?) (9) 
yl n ,\c T ,9 T -Bernoulli^,) (10) 

where y T nn , denotes the presence of a link between ob- 
jects n and n' at time r, and we set /3 = 0.1 to encour- 
age values of close to or 1 . While we assume the link 
probabilities 9 are independent at every time point t a 
possible extension would be to introduce dependence 
between these values at adjacent time points. 

6. Choice of kernel 

Since the DPVP is constructed using Gaussian pro- 
cesses there is great flexibility in the choice of covari- 
ance function (kernel). Here we describe only the op- 
tions used in the experiments in Section 8. We choose 
to fix the diagonal E(r, r) = 1 in all cases since the 
DPVP is invariant to scaling of the covariance matrix: 
by construction changing the prior marginal variance 
of the GP functions at any location does not effect the 
resulting distribution over partitions. 

Squared exponential. In the situation where there 
is a known covariate value, such as time or spatial lo- 
cation, associated with each data source or network, 
the canonical covariance function is the squared expo- 
nential kernel 

S(t ,0=exp(-(^2) 
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where I > is the lengthscale which controls the 
smoothness of the GPs. We put an exponential prior 
on I. 

Similarity kernel. In the multitask clustering set- 
ting we may often have no little prior knowledge about 
which data sources are likely to have similar clustering 
structure to others and would like to learn this from 
the data itself. In this case we fix the diagonal terms 
equal to 1 and put a Uniform [— 1, 1] prior on the off- 
diagonal terms. We ensure PSD matrices simply by 
rejecting any non-PSD matrices during the slice sam- 
pling. 

Tree structured covariance. In other cases we 
may have a tree structured dependency between the 
data sources: the data sources are the leaves of a tree 
which represents a know relationship. Each branch in 
the tree represents a Gaussian factor: the child node 
is normally distributed with mean equal to its parent 
and variance equal to the branch length. The total 
height of the tree is 1 so that the diagonal of the re- 
sulting covariance matrix is 1. We put a uniform prior 
on the branch lengths (under the constraint that no 
branch lengths can be negative). A concrete example 
is given in Section 8.3. 

7. MCMC Inference 

In the following, we present a method for inferring 
the latent variables and parameters of MCM and ECS 
model: the matrix of Gaussian process function val- 
ues, F, the weight vector v and the parameters of the 
covariance matrix E. Exact inference is intractable, 
so we develop a Markov Chain Monte Carlo (MCMC) 
procedure to sample from the posterior distribution. 
Each iteration is 0(DNKT + T 3 ) which for small T is 
the same as EM or sampling for a DPM. The cluster 
assignments c are not represented since these are a de- 
terministic function of F and v (since we ensure that 
E TT / = 1 in all cases the assignments do not depend on 
E). For both models we are able to integrate out the 
cluster parameters, 8 (the link probabilities for ECS), 
due to conjugacy: 

p(Y T |F T ,v)= J P {Y T \F r ,v,8 T )p(9)d8 (11) 

The sampler iterates as follows: 

Sampling the GP function values, F. The con- 
ditional posterior over the matrix F is given by 

(T \ N K-l 

np(Y r iF r , V ) n n w^e) 
r=l / n=l k=l 



Since we have a GP prior over F we use elliptical slice 
sampling (ESS, Murray et al., 2010), which is specifi- 
cally designed to sample from posteriors with strongly 
correlated Gaussian priors. We find that jointly sam- 
pling the K — 1 GPs associated with a datapoint gives 
better mixing than attempting to jointly sample all 
N(K - 1) GPs jointly (see Section 8.1). The covari- 
ance matrix for the K — 1 GPs for a single datapoint is 
block diagonal where each block is E. Naive computa- 
tion of the Cholesky would be 0(K 3 T 3 ), but utilising 
the block diagonal structure it is only 0(T 3 ), and T is 
typically relatively small in the applications we envis- 
age. 

Sampling the weight vector, v. The sampler suc- 
cessively samples each of the K — 1 weights Vk . Since 
we do not have conjugacy (due to the complex form 
of likelihood function) , we cannot sample directly from 
the posterior Vk\Y, F, v_& (where v_fe denotes the val- 
ues of v excluding v^). To overcome this problem, we 
use slice sampling (Neal, 2003b) with the reparame- 
terisation g(v) = log[v/ (1 — v)] so that j€R. 

Sampling parameters of the kernel. Learning 
the parameters of the kernel is of interest because it 
tells us how similar the partitions appear to be across 
covariate space. Since we ensure E TT ' = 1 the like- 
lihood for E is just Un=i Uk=i ^( f «fc|0, E). In fact, 
since values of f„fc for k > max c do not effect the par- 
titioning, we can trivially integrate these out so the 
the product over k in the likelihood need only go up 
to maxc. This both saves computation and improves 
mixing by avoiding conditioning on irrelevant informa- 
tion. For all three kernels described in Section 6 we 
use slice sampling to learn the kernel parameters: the 
lengthscale for the squared exponential kernel, the cor- 
relation coefficients for the similarity kernel, and the 
branch lengths for the tree structured kernel. 

Sampling the concentration parameter, a. We 

also use slice sampling to infer the hyperparameter a 
using a Gamma prior a ~ £(1,1). The likelihood is 
nf=i Beta(t>fc|l,a). 

Initialisation. The DPVP can suffer from the label 
switching problem: although the partitioning at two 
locations might be similar, they may look quite differ- 
ent according to the model if the labels are permuted. 
To alleviate this problem we initialised both MCM and 
ECS using the equivalent DPM model where the same 
clustering is shared across all locations. 
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8. Multitask clustering results 



Using synthetic data we first demonstrate some en- 
couraging characteristics of our inference method. We 
then apply MCM to two real world biological datasets. 
Unless otherwise stated we use 1000 MCMC iterations, 
discarding the first 500 as burnin. 

8.1. Synthetic data 

We use experiments on synthetic data to demonstrate 
three things. Firstly, on a single location dataset with 
N = 30 objects in three equal sized, well separated 
clusters with means —31, 0, +31 and equal covariances 
/ in D = 5 dimensions, we show that our Elliptical 
Slice Sampling (ESS) based inference is competitive 
with standard Gibbs sampling for DPMs (for T = 1 
MCM is exactly equivalent to a DPM). In Figure 2 we 
see that while ESS does sometimes get stuck at in a 
local mode (two of ten repeats), it typically finds areas 
of higher marginal likelihood than the standard Gibbs 
sampler. 

Secondly, on a dataset with T = 3 data sources, two 
of which have the same clustering structure as the 
first example, and one of which has a single cluster 
with mean and variance /, we show that sequentially 
jointly sampling the K — 1 GPs associated with each 
datapoint mixes better than attempting to sampling 
all N{K - 1) GPs and v jointly (Figure 3). The latter 
approach attempts to make very large global moves in 
the MCMC state space so many likelihood evaluations 
are required before a new point is accepted by ESS. 

Thirdly, on the same dataset we show we can learn a 
sensible similarity kernel. We use the similarity kernel 
defined in Section 6. We run MCMC for 200 iter- 
ations and, finding the chain appears to have burnt 
in after 100 iterations, calculate 95% credible inter- 
vals for £ using the final 100 samples. The correla- 
tion coefficient between the two data sources whose 
true clusterings are identical has credible interval is 
[0.31,0.65], whereas between the distinct data source 
and these two the credible intervals are [—0.39, 0.27] 
and [-0.37,0.23]. 

8.2. Cancer cell line encyclopedia 

The Cancer cell line encyclopedia (CCLE, Barretina 
et al., 2012) is a recently published resource to aid the 
understanding of why certain cancer subtypes are re- 
sistant to particular drugs. N = 432 cancer cell lines 
were grown in the presence of 24 different therapeu- 
tic compounds at nine different concentrations, and 
their growth measured after 10 days. These growth 
curves are summarised in terms of "active area" : the 



-250 




Figure 2. Comparison of our ESS based inference vs stan- 
dard Gibbs sampling for the DPM when there is only T = 1 
location. 
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Figure 3. Comparison of sampling methods for F. Dashed 
black: sequentially sampling the K — 1 GPs associated 
with each object n sequential, followed by v. Solid blue: 
sampling all of F and U jointly. Dashed red: alternating 
between these two. 
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total inhibition of growth summed over all concentra- 
tions. Alongside these sensitivity measurements vari- 
ous molecular characteristics of the cell lines are mea- 
sured, including gene expression (GE), copy number 
variation (CNV, the number of times a gene is dupli- 
cated in the cancer genome) and oncogene mutations 
(mutations such as insertions, deletions or single nu- 
cleotide polymorphisms, SNPs, in genes known to be 
involved in cancer). We consider two tasks: firstly, pre- 
dicting drug sensitivity, since this is clinically relevant 
(being able to predict sensitivity could help determine 
what drug is most appropriate for a particular patient) 
and secondly, learning how the clustering of cancer cell 
varies across the four data sources: GE, CNV, onco- 
gene mutations (ONCO) and sensitivity (SENS). 

To assess the predictive performance of MCM on drug 
sensitivity, we randomly choose 20 different sets of 10% 
of the sensitivity measurements to hold out, and at- 
tempt to impute these values. We compare to two 
extremes: using a DP mixture (DPM) model inde- 
pendently on each data source (in this case we need 
only run the algorithm for the sensitivity data source 
since this is what we are interesting in imputing) , and 
a DPM where the clustering is common to all data 
sources. These are extremes of MCM, representing 
minimum or maximum transfer learning respectively. 
The results are shown in Table 1. We see that the in- 
dependent clustering performs very poorly, the shared 
clustering does reasonably well, but MCM performs 
best since it is able to learn how much information to 
transfer from the other data sources to the drug sensi- 
tivity clustering task. The learnt correlation matrix is 
shown in Figure 4: of particular interest are the cor- 
relation coefficients between sensitivity and the other 
data sources. We see there is a positive correlation 
to CNV, whereas the correlation is small (and in fact 
slightly negative) to gene expression and oncogene mu- 
tations, suggesting that copy number variation is the 
most indicative characteristic of which drugs a cancer 
will be sensitive/resistant to. 



Datasct Independent Shared MCM/ECS 

CCLE -3.221 ± 0.552 -1.109 ± 0.069 -0.902 ± 0.097 

van de Bunt -0.530 ± 0.025 -0.502 ± 0.022 -0.095 ± 0.017 

HapMap -1.357 ± 0.055 -1.134 ± 0.013 -1.277 ±0.016 

Table 1. Predictive performance results for both MCM and 
ECS on real world datasets. Values are log predictive like- 
lihood per heldout data entry. 
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Figure 4. Correlation matrix learnt for different data 
sources in the CCLE dataset. 



8.3. HapMap gene expression data 

The HapMap project 1 is primarily an attempt to mea- 
sure genetic variation between 1301 individuals from 
different human populations, but gene expression data 
is also available in 618 individuals (Montgomery et al., 
2010). We consider the task of discovering regulatory 
modules of genes from this gene expression data, but 
rather than simply learning a global clustering we will 
use the MCM to learn population specific clusterings 
of the genes. From around 20, 000 genes we filter down 
to the 1000 most variable ones. The known tree over 
the different populations is shown in Figure 5. We use 
this tree to define the covariance matrix as described 
in Section 6. 

We again assess predictive performance on 10 heldout 
sets consisting of 10% of data entries. In this case we 
find that although MCM performs better than inde- 
pendent clustering in each population, using a shared 
clustering of genes across all populations performs 
best. This suggests the biological conclusion that gene 
regulatory modules do not vary between diverse hu- 
man populations. 

9. Network modelling results 

We experimentally evaluate the ECS model on both 
synthetic and real-world data. 

9.1. Synthetic data 

We explore the ability of ECS to recover the parti- 
tions in synthetic time series network data. We hand- 
constructed a set of six square binary matrices which 



1 http : / /www . Sanger . ac .uk/resources/downloads/ 
human/hapmap3 . html 
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Figure 5. Tree structure of human populations in 
HapMap. 



encode the friendship links among N — 30 people 
evolving through time, as shown in Figure 6. People 
form groups (clusters) which determine the links and 
non-links between them. As time passes, the partition- 
ing of people changes; new friendship links are created 
while others break. The closer in time two snapshots 
are, the more similar we expect the related partitions 
will be. We ran ECS for 200 MCMC iterations and 
the sample with highest marginal likelihood is shown 
in Figure 7. We see that the solution provided by our 
model proposes two clusters at t = 0, which shrink 
as a third cluster is generated between them. The 
partition found at t — 3 is suboptimal: with more it- 
erations we might hope the white and yellow clusters 
used here would be replaced by the red cluster used at 
latter time points. However, multiple hypotheses are 
of course capable of explaining the same data. 




Figure 6. Synthetic network data. Each matrix represents 
('non-blinks among pair of objects. White corresponds to 
one (link) and black to zero (non-link). 
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Figure 7. Learnt clustering for synthetic network data. 
Colours denote assignment. 

9.2. van de Bunt's students 

In Van De Bunt et al. (1999) 32 university fresh- 
man were surveyed at seven time points about who 
in their class they considered as friends. The first 
four time points were two weeks apart and the last 
four being three weeks apart. We binarise the origi- 
nal — 5 scale, taking 'Best friendship', 'Friendship' 
and 'Friendly relationship' as 1, 'Neutral relationship', 
'Unknown person' and 'Troubled relationship' as 0, 
and 'non-response' as missing. We also symmetrise 
the matrix by assuming friendship if either individual 
reported it. We test the predictive performance of ECS 
using the squared exponential kernel on this dataset by 
holding out 10% of all links across all time points in 10 
different training/test splits. We compare to indepen- 
dent IRMs at each time point and a model with shared 
clustering but independent link probabilities across all 
time points. The results in Table 1 show that ECS sig- 
nificantly outperforms both these extremes in terms of 
heldout predictive performance. The average length- 
scale learnt was 1.43 weeks, showing that while there 
was similarity in community structure between proxi- 
mal time points there were also significant changes to 
be taken into account over the time course of the full 
dataset. 

10. Conclusion 

Given the central role of clustering in unsupervised 
learning we expect the dependent partitioned-valued 
process we introduce here to have many potential ap- 
plications. We have investigated two models derived 
from the DP VP: a multitask clustering model, which 
is, to the best of our knowledge, the first such model 
derived under a fully probabilistic framework, and 
a time series network model that is appropriate for 
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the many real world networks that constantly evolve 
through time. 

Various directions for future work are open. Firstly, 
improved inference is of interest. While we used 
MCMC inference, variational methods would be a 
natural fit for either model: expectation propaga- 
tion (Minka, 2001) is known to be particularly effective 
for Gaussian process classification (Nickisch & Ras- 
musscn, 2008) , which is a subcomponent of DP VP, and 
variational Bayes (Attias, 2000; Ghahramani & Beal, 
2001) is commonly used for mixture modelling. Sec- 
ondly, it is possible to have covariates associated with 
each object n. Making the / also a function of these co- 
variates would give a model related to the distance de- 
pendent CRP (Blci & Frazicr, 2011), and smart com- 
putation of the Cholesky would be only 0(T 3 + N 3 ) 
rather than the naive 0(T 3 N 3 ). Thirdly, other appli- 
cations suggest themselves: modelling spatially vary- 
ing ecological networks or the difference between reg- 
ulatory modules across different human tissue types. 
Finally it would be of considerable interest to derive a 
dependent partition- valued process that, like the FCP, 
does not explicitly label clusters, and therefore does 
not suffer from the label switching problem. 
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